AITopics | hpc system

Collaborating Authors

hpc system

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Automated Dynamic AI Inference Scaling on HPC-Infrastructure: Integrating Kubernetes, Slurm and vLLM

Trappen, Tim, Keßler, Robert, Pabel, Roland, Achter, Viktor, Wesner, Stefan

arXiv.org Artificial IntelligenceNov-27-2025

Due to rising demands for Artificial Inteligence (AI) inference, especially in higher education, novel solutions utilising existing infrastructure are emerging. The utilisation of High-Performance Computing (HPC) has become a prevalent approach for the implementation of such solutions. However, the classical operating model of HPC does not adapt well to the requirements of synchronous, user-facing dynamic AI application workloads. In this paper, we propose our solution that serves LLMs by integrating vLLM, Slurm and Kubernetes on the supercomputer \textit{RAMSES}. The initial benchmark indicates that the proposed architecture scales efficiently for 100, 500 and 1000 concurrent requests, incurring only an overhead of approximately 500 ms in terms of end-to-end latency.

gateway, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3774902.3776632

2511.21413

Country:

North America > United States (0.70)
Asia (0.68)
Europe > Germany > North Rhine-Westphalia (0.14)

Genre: Research Report (0.70)

Industry:

Information Technology (0.95)
Education > Educational Setting (0.50)

Technology:

Information Technology > Scientific Computing (1.00)
Information Technology > Cloud Computing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.71)
(2 more...)

Add feedback

ClusterRCA: An End-to-End Approach for Network Fault Localization and Classification for HPC System

Sun, Yongqian, Pan, Xijie, Xiong, Xiao, Tao, Lei, Wang, Jiaju, Zhang, Shenglin, Yuan, Yuan, Li, Yuqi, Jian, Kunlin

arXiv.org Artificial IntelligenceSep-23-2025

Network failure diagnosis is challenging yet critical for high-performance computing (HPC) systems. Existing methods cannot be directly applied to HPC scenarios due to data heterogeneity and lack of accuracy. This paper proposes a novel framework, called ClusterRCA, to localize culprit nodes and determine failure types by leveraging multimodal data. ClusterRCA extracts features from topologically connected network interface controller (NIC) pairs to analyze the diverse, multimodal data in HPC systems. To accurately localize culprit nodes and determine failure types, ClusterRCA combines classifier-based and graph-based approaches. A failure graph is constructed based on the output of the state classifier, and then it performs a customized random walk on the graph to localize the root cause. Experiments on datasets collected by a top-tier global HPC device vendor show ClusterRCA achieves high accuracy in diagnosing network failure for HPC systems. ClusterRCA also maintains robust performance across different application scenarios.

data mining, machine learning, node, (19 more...)

arXiv.org Artificial Intelligence

2506.20673

Country: Europe (0.28)

Genre: Research Report > New Finding (0.46)

Industry:

Energy (0.47)
Telecommunications (0.47)
Information Technology (0.46)

Technology:

Information Technology > Scientific Computing (1.00)
Information Technology > Communications > Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Diagnosis (0.68)
(2 more...)

Add feedback

Improving the Efficiency of a Deep Reinforcement Learning-Based Power Management System for HPC Clusters Using Curriculum Learning

Budiarjo, Thomas, Pradata, Santana Yuda, Santiyuda, Kadek Gemilang, Amrizal, Muhammad Alfian, Pulungan, Reza, Takizawa, Hiroyuki

arXiv.org Artificial IntelligenceMar-14-2025

High energy consumption remains a key challenge in high-performance computing (HPC) systems, which often feature hundreds or thousands of nodes drawing substantial power even in idle or standby modes. Although powering down unused nodes can improve energy efficiency, choosing the wrong time to do so can degrade quality of service by delaying job execution. Machine learning, in particular reinforcement learning (RL), has shown promise in determining optimal times to switch nodes on or off. In this study, we enhance the performance of a deep reinforcement learning (DRL) agent for HPC power management by integrating curriculum learning (CL), a training approach that introduces tasks with gradually increasing difficulty. Using the Batsim-py simulation framework, we compare the proposed CL-based agent to both a baseline DRL method (without CL) and the conventional fixed-time timeout strategy. Experimental results confirm that an easy-to-hard curriculum outperforms other training orders in terms of reducing wasted energy usage. The best agent achieves a 3.73% energy reduction over the baseline DRL method and a 4.66% improvement compared to the best timeout configuration (shutdown every 15 minutes of idle time). In addition, it reduces average job waiting time by 9.24% and maintains a higher job-filling rate, indicating more effective resource utilization. Sensitivity tests across various switch-on durations, power levels, and cluster sizes further reveal the agent's adaptability to changing system parameters without retraining. These findings demonstrate that curriculum learning can significantly improve DRL-based power management in HPC, balancing energy savings, quality of service, and robustness to diverse configurations.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3718350.3718359

2502.20348

Country:

Asia > Indonesia (0.28)
Asia > Singapore (0.18)
North America > United States > New York > New York County > New York City (0.14)
(2 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Electrical Industrial Apparatus (0.82)
Government > Regional Government > North America Government > United States Government (0.46)
Energy > Oil & Gas > Upstream (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Toward Smart Scheduling in Tapis

Stubbs, Joe, Padhy, Smruti, Cardone, Richard

arXiv.org Artificial IntelligenceAug-5-2024

The Tapis framework provides APIs for automating job execution on remote resources, including HPC clusters and servers running in the cloud. Tapis can simplify the interaction with remote cyberinfrastructure (CI), but the current services require users to specify the exact configuration of a job to run, including the system, queue, node count, and maximum run time, among other attributes. Moreover, the remote resources must be defined and configured in Tapis before a job can be submitted. In this paper, we present our efforts to develop an intelligent job scheduling capability in Tapis, where various attributes about a job configuration can be automatically determined for the user, and computational resources can be dynamically provisioned by Tapis for specific jobs. We develop an overall architecture for such a feature, which suggests a set of core challenges to be solved. Then, we focus on one such specific challenge: predicting queue times for a job on different HPC systems and queues, and we present two sets of results based on machine learning methods. Our first set of results cast the problem as a regression, which can be used to select the best system from a list of existing options. Our second set of results frames the problem as a classification, allowing us to compare the use of an existing system with a dynamically provisioned resource.

queue, queue time, tapis, (14 more...)

arXiv.org Artificial Intelligence

2408.03349

Country:

North America > United States > Texas > Travis County > Austin (0.15)
North America > United States > Texas > Shelby County > Center (0.05)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.48)

Add feedback

I/O in Machine Learning Applications on HPC Systems: A 360-degree Survey

Lewis, Noah, Bez, Jean Luca, Byna, Suren

arXiv.org Artificial IntelligenceApr-16-2024

Because of the increased popularity of Machine Learning (ML) workloads, there is a rising demand for I/O systems that can effectively accommodate their distinct I/O access patterns. Write operation bursts commonly dominate traditional workloads; however, ML workloads are usually read-intensive and use many small files [99]. Due to the absence of a well-established consensus on the preferred I/O stack for ML workloads, numerous developers resort to crafting their own ad-hoc algorithms and storage systems to cater to the specific requirements of their applications [50]. This can result in sub-optimal application performance due to the under-utilization of the storage system, prompting the necessity for novel I/O optimization methods tailored to the demands of ML workloads. In Figure 1, we show the evolving I/O stack used for running ML workloads (on the right side) in comparison with the traditional HPC I/O stack (on the left side). Traditional HPC I/O stack has been developed to support massive parallelism. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page.

application, dataset, workload, (15 more...)

arXiv.org Artificial Intelligence

2404.10386

Country:

North America > United States > New York > New York County > New York City (0.14)
North America > United States > Utah > Salt Lake County > Salt Lake City (0.04)
North America > United States > Washington > King County > Renton (0.04)
(15 more...)

Genre: Research Report (0.50)

Industry:

Energy (0.93)
Information Technology > Services (0.67)
Government > Regional Government > North America Government > United States Government (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.66)

Add feedback

KPIs-Based Clustering and Visualization of HPC jobs: a Feature Reduction Approach

Halawa, Mohamed Soliman, Díaz-Redondo, Rebeca P., Fernández-Vilas, Ana

arXiv.org Artificial IntelligenceDec-11-2023

High-Performance Computing (HPC) systems need to be constantly monitored to ensure their stability. The monitoring systems collect a tremendous amount of data about different parameters or Key Performance Indicators (KPIs), such as resource usage, IO waiting time, etc. A proper analysis of this data, usually stored as time series, can provide insight in choosing the right management strategies as well as the early detection of issues. In this paper, we introduce a methodology to cluster HPC jobs according to their KPI indicators. Our approach reduces the inherent high dimensionality of the collected data by applying two techniques to the time series: literature-based and variance-based feature extraction. We also define a procedure to visualize the obtained clusters by combining the two previous approaches and the Principal Component Analysis (PCA). Finally, we have validated our contributions on a real data set to conclude that those KPIs related to CPU usage provide the best cohesion and separation for clustering analysis and the good results of our visualization methodology.

artificial intelligence, data mining, machine learning, (18 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/ACCESS.2021.3057427

2312.06534

Country:

Europe (0.28)
Africa > Middle East > Egypt (0.14)

Genre: Research Report > New Finding (1.00)

Industry: Energy > Oil & Gas > Upstream (0.67)

Technology:

Information Technology > Sensing and Signal Processing (1.00)
Information Technology > Scientific Computing (1.00)
Information Technology > Information Management (1.00)
(5 more...)

Add feedback

Unsupervised KPIs-Based Clustering of Jobs in HPC Data Centers

Halawa, Mohamed S., Díaz-Redondo, Rebeca P., Fernández-Vilas, Ana

arXiv.org Artificial IntelligenceDec-11-2023

Performance analysis is an essential task in High-Performance Computing (HPC) systems and it is applied for different purposes such as anomaly detection, optimal resource allocation, and budget planning. HPC monitoring tasks generate a huge number of Key Performance Indicators (KPIs) to supervise the status of the jobs running in these systems. KPIs give data about CPU usage, memory usage, network (interface) traffic, or other sensors that monitor the hardware. Analyzing this data, it is possible to obtain insightful information about running jobs, such as their characteristics, performance, and failures. The main contribution in this paper is to identify which metric/s (KPIs) is/are the most appropriate to identify/classify different types of jobs according to their behavior in the HPC system. With this aim, we have applied different clustering techniques (partition and hierarchical clustering algorithms) using a real dataset from the Galician Computation Center (CESGA). We have concluded that (i) those metrics (KPIs) related to the Network (interface) traffic monitoring provide the best cohesion and separation to cluster HPC jobs, and (ii) hierarchical clustering algorithms are the most suitable for this task. Our approach was validated using a different real dataset from the same HPC center.

algorithm, interface, kpi, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.3390/s20154111

2312.06546

Country:

Europe > Spain (0.04)
North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
Africa > Middle East > Egypt > Cairo Governorate > Cairo (0.04)

Genre: Research Report > New Finding (0.93)

Industry: Information Technology > Services (0.40)

Technology:

Information Technology > Scientific Computing (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Benchmarking Performance of Deep Learning Model for Material Segmentation on Two HPC Systems

Williams, Warren R., Glandon, S. Ross, Morris, Luke L., Cheng, Jing-Ru C.

arXiv.org Artificial IntelligenceJul-27-2023

Performance Benchmarking of HPC systems is an ongoing effort that seeks to provide information that will allow for increased performance and improve the job schedulers that manage these systems. We develop a benchmarking tool that utilizes machine learning models and gathers performance data on GPU-accelerated nodes while they perform material segmentation analysis. The benchmark uses a ML model that has been converted from Caffe to PyTorch using the MMdnn toolkit and the MINC-2500 dataset. Performance data is gathered on two ERDC DSRC systems, Onyx and Vulcanite. The data reveals that while Vulcanite has faster model times in a large number of benchmarks, and it is also more subject to some environmental factors that can cause performances slower than Onyx. In contrast the model times from Onyx are consistent across benchmarks. 1. Introduction The demand for intelligent devices and tools that will facilitate safer work environments, safer roadways, and an increased quality of life is ever growing.

artificial intelligence, benchmark, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2307.14921

Country:

North America > United States > Mississippi > Warren County > Vicksburg (0.04)
North America > United States > Florida > Alachua County > Gainesville (0.04)

Genre: Research Report (0.40)

Industry: Health & Medicine (0.35)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Online Job Failure Prediction in an HPC System

Antici, Francesco, Borghesi, Andrea, Kiziltan, Zeynep

arXiv.org Artificial IntelligenceJun-30-2023

Modern High Performance Computing (HPC) systems are complex machines, with major impacts on economy and society. Along with their computational capability, their energy consumption is also steadily raising, representing a critical issue given the ongoing environmental and energetic crisis. Therefore, developing strategies to optimize HPC system management has paramount importance, both to guarantee top-tier performance and to improve energy efficiency. One strategy is to act at the workload level and highlight the jobs that are most likely to fail, prior to their execution on the system. Jobs failing during their execution unnecessarily occupy resources which could delay other jobs, adversely affecting the system performance and energy consumption. In this paper, we study job failure prediction at submit-time using classical machine learning algorithms. Our novelty lies in (i) the combination of these algorithms with Natural Language Processing (NLP) tools to represent jobs and (ii) the design of the approach to work in an online fashion in a real system. The study is based on a dataset extracted from a production machine hosted at the HPC centre CINECA in Italy. Experimental results show that our approach is promising.

hpc system, online job failure prediction

arXiv.org Artificial Intelligence

2308.15481

Country: Europe > Italy (0.24)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (0.87)

Add feedback

Development of Authenticated Clients and Applications for ICICLE CI Services -- Final Report for the REHS Program, June-August, 2022

Samar, Sahil, Chen, Mia, Karpinski, Jack, Ray, Michael, Sarin, Archita, Garcia, Christian, Lange, Matthew, Stubbs, Joe, Thomas, Mary

arXiv.org Artificial IntelligenceApr-16-2023

The Artificial Intelligence (AI) institute for Intelligent Cyberinfrastructure with Computational Learning in the Environment (ICICLE) is funded by the NSF to build the next generation of Cyberinfrastructure to render AI more accessible to everyone and drive its further democratization in the larger society. We describe our efforts to develop Jupyter Notebooks and Python command line clients that would access these ICICLE resources and services using ICICLE authentication mechanisms. To connect our clients, we used Tapis, which is a framework that supports computational research to enable scientists to access, utilize, and manage multi-institution resources and services. We used Neo4j to organize data into a knowledge graph (KG). We then hosted the KG on a Tapis Pod, which offers persistent data storage with a template made specifically for Neo4j KGs. In order to demonstrate the capabilities of our software, we developed several clients: Jupyter notebooks authentication, Neural Networks (NN) notebook, and command line applications that provide a convenient frontend to the Tapis API. In addition, we developed a data processing notebook that can manipulate KGs on the Tapis servers, including creations of a KG, data upload and modification. In this report we present the software architecture, design and approach, the successfulness of our client software, and future work.

artificial intelligence, knowledge graph, machine learning, (13 more...)

arXiv.org Artificial Intelligence

2304.11086

Country:

North America > United States > California > San Diego County > San Diego (0.05)
North America > United States > Texas > Travis County > Austin (0.04)
North America > United States > Oregon > Multnomah County > Portland (0.04)
(4 more...)

Genre: Research Report (0.40)

Industry:

Information Technology > Security & Privacy (0.73)
Education > Educational Setting (0.47)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (0.34)

Add feedback